6 research outputs found

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

    Full text link
    As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of pro-gramming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a char-acterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls com-mon in modern queue designs, and (3) a thorough evalu-ation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focus-ing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magni-tude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU de-vices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput. 1

    A power-measurement methodology for large-scale, high-performance computing

    Full text link
    Improvement in the energy efficiency of supercomputers can be accelerated by improving the quality and comparability of efficiency measurements. The ability to generate accu-rate measurements at extreme scale are just now emerging. The realization of system-level measurement capabilities can be accelerated with a commonly adopted and high quality measurement methodology for use while running a workload, typically a benchmark. This paper describes a methodology that has been developed collaboratively through the Energy Efficient HPC Work-ing Group to support architectural analysis and compara-tive measurements for rankings, such as the Top500 and Green500. To support measurements with varying amounts of effort and equipment required we present three distinct levels of measurement, which provide increasing levels of ac-curacy. Level 1 is similar to the Green500 run rules today, a single average power measurement extrapolated from a subset of a machine. Level 2 is more comprehensive, but still widely achievable. Level 3 is the most rigorous of the three methodoloiges but is only possible at a few sites. How-ever, the Level 3 methodology generates a high quality re-sult that exposes details that the other methodologies may miss. In addition, we present case studies from the Leibniz Supercomputing Centre (LRZ), Argonne National Labora-tory (ANL) and Calcul Québec Universite ́ Laval that ex-plore the benefits and difficulties of gathering high quality, system-level measurements on large-scale machines
    corecore